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Abstract 

In this tutorial we review the essential arguments behing entropic in- 
ference. We focus on the epistemological notion of information and its 
relation to the Bayesian beliefs of rational agents. The problem of updat- 
ing from a prior to a posterior probability distribution is tackled through 
an eliminative induction process that singles out the logarithmic relative 
entropy as the unique tool for inference. The resulting method of Maxi- 
mum relative Entropy (ME), includes as special cases both MaxEnt and 
Bayes' rule, and therefore unifies the two themes of these workshops - 
the Maximum Entropy and the Bayesian methods - into a single general 
inference scheme. 

1 Introduction 

Our subject is inductive inference. Our goal in this tutorial paper is to review 
the problem of updating from a prior probability distribution to a posterior 
distribution when new information becomes available. 

First we tackle the question of the nature of information itself: What is 
information? It is clear that data "contains" or "conveys" information, but 
what does this precisely mean? Is information physical? We discuss how in a 
properly Bayesian framework one can usefully adopt a concept of information 
that is more directly related to the epistemological concerns of rational agents. 

Then we turn to the actual methods to process information. We argue for 
the uniqueness and universality of the Method of Maximum relative Entropy 
(ME) and then we discuss its relation to Bayesian methods. At first sight 
Bayesian and Maximum Entropy methods appear unrelated. Bayes' rule is the 
natural way to update probabilities when the new information is in the form 
of data. On the other hand, Jaynes' method of maximum entropy, MaxEnt, 
is designed to handle information in the form of constraints 1 . An important 
question is whether they are compatible with each other. We show that the 
ME method includes both MaxEnt and Bayesian methods as special cases and 
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allows us to extend them to situations that lie beyond the reach of either of 
them individually. 

Finally we explore an important extension of the ME method. The distri- 
bution of maximum entropy has the highest probability of being the correct 
choice of posterior, but how justified are we in ruling out those distributions 
that do not maximize the entropy? The extended ME assigns a probability to 
those other distributions and this has a wide variety of applications: it provides 
a connection to the theory of large deviations, to fluctuation theory, to entropic 
priors, and most recently to quantum mechanics. The possibilities are endless. 

We make no attempt to provide a review of the literature on entropic infer- 
ence. The following list, which reflects only some contributions that are directly 
related to the particular approach described in this tutorial, is incomplete but 
might nevertheless be useful: Jaynes [I], Shore and Johnson [2], Williams [3], 
Skilling g], Rodriguez [5] [6], Giffin and Caticha 0-Q2]. 

2 What is information? 

The expression that systems "carry" or "contain" information can perhaps be 
traced to Shannon's theory of communication: a system is analogous to a mes- 
sage. The system "carries" information about its own state and, in this sense, 
one can say that information is physical. Such physical information is directly 
associated to the system. Our interest here is in an altogether different notion 
of information which we might call epistemological and which is directly as- 
sociated to the beliefs of rational agents. Indeed, any fully Bayesian theory of 
information requires an explicit account of how such epistemological information 
is related to rational beliefs. 

The need to update from one state of belief to another is driven by the 
conviction that not all probability assignments are equally good; some beliefs 
are preferable to others in the very pragmatic sense that they enhance our 
chances to successfully navigate this world. The idea is that, to the extent 
that we wish to be called rational, we will improve our beliefs by revising them 
when new information becomes available: Information is what forces a change 
of rational beliefs. Or, to put it more explicitly: Information is a constraint on 
rational beliefs. 

This definition - information is a constraint - is sufficient for our present 
purposes but would benefit from further elaboration. The definition captures a 
notion of information that is directly related to changing our minds: information 
is the driving force behind the process of learning. It incorporates an important 
feature of rationality: being rational means accepting that our beliefs must 
be constrained in very specific ways - not everything goes. But surely this 
is not enough: the indiscriminate acceptance of any arbitrary constraint does 
not qualify as rational behavior. To be rational an agent must exercise some 
judgement before accepting a particular piece of information as a reliable basis 
for the revision of its beliefs and this raises questions about what judgements 
might be considered sound. Indeed, there is no implication that the information 
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must be true; only that we accept it as true. False information is information 
too, at least as long as we are prepared to accept it and allow it to affect our 
beliefs. 

The paramount virtue of our definition is that it is useful. It allows precise 
quantitative calculations even though the notion of an amount of information, 
whether measured in bits or otherwise, is not introduced. By information in 
its most general form, we just mean the set of constraints on the family of 
acceptable posterior distributions and this is precisely the kind of information 
the method of maximum entropy is designed to handle. 

3 Updating probabilities: the ME method 

The uncertainty about a variable x € X (whether discrete or continuous, in one 
or several dimensions) is described by a probability distribution q{x). Our goal 
is to design a method to update from a prior distribution q(x) to a posterior 
distribution P(x) when new information in the form of constraints becomes 
available. (The constraints can be given in terms of expected values but this is 
not necessary. Other types of constraints are allowed too; an example is appears 
in section 5.) 

The problem is to select a distribution from among all those that satisfy 
the constraints. The procedure is to rank the candidate distributions in order 
of increasing preference 0]. It is clear that to accomplish our goal the ranking 
must be transitive: if distribution p\ is preferred over p2, and p2 is preferred 
over p3, then p\ is preferred over p-$. Such transitive rankings are implemented 
by assigning to each p(x) a real number S[p] in such a way that if p\ is preferred 
over p 2 , then S[pi] > Sfo]. The selected distribution P (one or possibly many, 
for on the basis of the available information we might have several equally 
preferred distributions) will be that which maximizes the quantity S[p], which 
we will henceforth call entropy. We are thus led to a method of Maximum 
Entropy (ME) that involves entropies that are real numbers and that are meant 
to be maximized. These features are imposed by design; they are dictated by 
the function that the ME method is being designed to perform and not by any 
objective properties of the external world. 

Next we must make a definite choice for the functional S[p\. Since the 
purpose of the method is to update from priors to posteriors the ranking scheme 
must depend on the particular prior q and therefore the entropy S must be a 
functional of both p and q. Thus the entropy S\p, q] produces a ranking of the 
distributions p relative to the given prior q: S[p, q] is the entropy of p relative to 
q. Accordingly S[p, q] is commonly called relative entropy, but since all entropies 
are relative, even when relative to a uniform distribution, the modifier 'relative' 
is redundant and can be dropped. 

The functional S[p, q] is selected by a process of eliminative induction. The 
idea is simple: we start with a sufficiently broad family of candidates and identify 
a number of special cases for which we know what the preferred distribution 
ought to be. Then we just eliminate all those candidates that fail to provide the 
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right update. As we shall see the selection criteria adopted below are sufficiently 
constraining that there is a single entropy functional S\p, q] that survives the 
process of elimination. 

This approach has a number of virtues. First, to the extent that the selection 
criteria are universally desirable, then the single surviving entropy functional 
will be of universal applicability too. Second, the reason why any entropy 
candidate is eliminated is quite explicit - at least one of the selection criteria is 
violated. Thus, the justification behind the single surviving entropy is not that 
it leads to demonstrably correct inferences, but rather, that other entropies are 
demonstrably wrong. 

The selection criteria are chosen to reflect the conviction that information 
collected in the past and codified into the prior distribution is valuable and 
should not be ignored. This attitude is very conservative: the only aspects of 
one's beliefs that should be updated are those for which new evidence has been 
supplied. Moreover, as we shall see below, the selection criteria merely tell us 
what not to update, which has the virtue of maximizing objectivity - there are 
many ways to change something but only one way to keep it the same. These 
ideas are summarized in the following 

Principle of Minimal Updating (PMU): Beliefs must be revised only to 
the extent required by the new information. 

Three selection criteria, a brief motivation for them, and their consequences 
for the functional form of the entropy are listed below (proofs and more details 
are given in [12]). The reason these criteria are so constraining is that they 
refer to three infinitely large classes of special cases where the desired update is 
known. 

Criterion 1: Locality. Local information has local effects. 
If the information to be processed does not refer to an x in a particular subdo- 
main V C X then the PMU requires that we do not change our minds about 
x S T>. More precisely, we require that the prior conditioned on T> is not up- 
dated. The selected posterior is such that P(x\V) = q(x\T>). Dropping additive 
terms and multiplicative factors that do not affect the overall ranking, the sur- 
viving entropy functionals are of the form 

S[p,q] = J dxF (p(x),q(x),x) , (1) 

where F is some unknown function and by J dx we mean a discrete sum or 
continuous integral (possibly over several dimensions) as the case might require. 

Criterion 2: Coordinate invariance. The system of coordinates carries 
no information. 

The points x can be labeled in different ways using different coordinate systems 
but this should not affect the ranking of the distributions. The consequence of 
criterion 2 is that the surviving entropies can be written as 

S]p,q]= [dxm(x)*(^,l£L) , (2) 
J \m{x) m(x) ) 
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where m(x) is a probability density, which implies that dxm(x), p(x)/m(x), and 
q(x)/m(x) are coordinate invariants. (Again, additive terms and multiplicative 
factors that do not affect the overall ranking have been dropped.) We see that 
the single unknown function F in (fTJ) with three arguments has been replaced 
by two unknown functions. One is the density m(x), and the other is a function 
$ with two arguments. The density m{x) is determined by invoking the locality 
criterion once again. 

Criterion 1 (a special case): When there is no new information there is 
no reason to change one 's mind. 

When no new information is available the domain T> in criterion 1 coincides with 
the whole space A". The conditional probabilities q{x\T>) — q{x\X) — q{x) should 
not be updated and the selected posterior coincides with the prior, P{x) = q{x). 
The consequence is that up to normalization the unknown m(x) must be the 
prior distribution q(x). The entropy is now restricted to functionals of the form 



Criterion 3: Independence. When systems are known to be independent 
it should not matter whether they are treated separately or jointly. 
The preservation of independence is a particularly important concern for science 
because without it science is not possible. The reason is that in any inference it 
is assumed that the universe is partitioned into the system of interest and other 
systems that constitute the rest of the universe. What is important about those 
other systems is precisely that they can be ignored - whether they are included 
in the analysis or not should make no difference. If they did matter they should 
have been incorporated as part of the system of interest in the first place. 

It is crucial that Criterion 3 be applied to all independent systems whether 
they are identical or not, whether just two or many, or even infinitely many. This 
criterion is sufficiently constraining that (up to additive terms and multiplicative 
factors that do not affect the overall ranking scheme) there is a single surviving 
entropy functional given by the usual logarithmic relative entropy [12] . 



These results are summarized as follows: 

The ME method: The objective is to update from a prior distribution 
q to a posterior distribution P given the information that the posterior lies 
within a certain family of distributions p. The selected posterior P is that which 
maximizes the entropy S\p, q] . Since prior information is valuable the functional 
S\p, q] is chosen so that beliefs are updated only to the minimal extent required by 
the new information. No interpretation for S\p, q] is given and none is needed. 
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4 Bayes' rule and its generalizations 



Bayes' rule is used to make inferences about one or several quantities 9 £ 
on the basis of information in the form of data x G X. More specifically, the 
problem is to update our beliefs about 9 on the basis of three pieces of infor- 
mation: (1) the prior information codified into a prior distribution q{9)] (2) the 
data x £ X (obtained in one or many experiments); and (3) the known relation 
between 9 and x given by the model as defined by the sampling distribution or 
likelihood function, q(x\9). The updating consists of replacing the prior prob- 
ability distribution q{9) by a posterior distribution P{9) that applies after the 
data has been processed. 

Remark: We emphasize that the information about how x is related to 9 is 
contained in the functional form of the distribution q(-\9) which is completely 
unrelated to the actual values of the observed data. 

The insight that will allow Bayes' rule to be smoothly incorporated into the 
entropic inference framework [3] [9] is that the relevant universe of discourse is 
not O but the product space O x X [5] [6]. We deal with joint distributions and 
the relevant joint prior is q(x, 9) = q{9)q(x\9) . 
Remark: Bayes' rule is usually written in the form 

q{e\x)= q {6)*A, (5) 

and called Bayes' theorem. This formula is a restatement of the product rule. 
It is valid for any value of x whether it coincides with the observed data or 
not and therefore it is a simple consequence of the internal consistency of the 
prior beliefs. Within the framework of entropic inference the left hand side is 
not a posterior but rather a prior probability - it is the prior probability of 9 
conditional on x. 

Next we collect data and the observed values turn out to be x' . This con- 
strains the posterior to the family of distributions p(x, 9) defined by 

p{x) = Jd6p(6,x) =5{x-x') . (6) 

This data information is not, however, sufficient to determine the joint distri- 
bution 

p(x, 9) = p{x)p(9\x) = S(x - x')p{6\x') . (7) 

Any choice of p(9\x') is in principle possible. Within the framework of entropic 
inference (see [9]) the joint posterior P(x,9) is the minimal update from the 
prior q(x, 9) that agrees with the data constraint. To find it maximize the 
entropy, 

S[p, q} = -J dxd9 p(x, 9) log , (8) 

subject to the infinite number of constraints given by eq. Note that there 
is one constraint for each value of x. The corresponding Lagrange multipliers 
are denoted X(x). Maximizing subject to (JS]) and normalization, 

S {S + a [J dxd9 P (x, 6)-l]+Jdx X(x) [J d9 p(x, 9) - 5(x - x')] } = , (9) 
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yields 



P(x,6)=q(x,6)— , (10) 
where Z is a normalization constant, and A(x) is determined from ([6J, 

p A(x) pA(x) 

Jd0q{x,0)—=q(x)—=8{x-x') 1 (11) 
so that the joint posterior is 

P(x, 9) = q{x, 9) WLZll = ,5( x _ . (12) 

The corresponding marginal posterior probability P(9) is 

P{9) = JdxP(9,x) = q(9\x') = q{9)^p^- , (13) 

q{x ) 

which coincides with Bayes' rule. This is intuitively reasonable: we maintain 
those beliefs about 9 that are consistent with the data values x' that turned out 
to be true. Data values that were not observed are discarded because they are 
now known to be false. The extension to repeatable independent experiments 
is straightforward [T2] . 

Next I give a couple of very simple examples that show how entropic methods 
allow generalizations of Bayes' rule. 

Example 1.— Jeffrey's rule. As before, the prior information consists of our 
prior knowledge about 9 given by the distribution q(9) and the relation between 
x and 9 is given by the likelihood q(x\9). But now the information about x is 
limited because the data is uncertain. The marginal posterior p{x) is no longer 
a sharp delta function but some other known distribution, p(x) — Pd(x). This 
is still an infinite number of constraints 

p(x)=Jd9p(9,x) = P D (x). (14) 

Maximizing ([8} subject to (fl4|) and normalization, leads to 

P(x,9) = P D (x)q(9\x) . (15) 

The corresponding marginal posterior, 

P(e)=jdxP D (x)q{d\x) = q{d)JdxP D {x)^A , (16) 

q{x) 

is known as Jeffrey's rule. In the limit when the data are sharply determined 
Pr>{x) = 5{x — x') the posterior reproduces Bayes' rule (TTBl . 
Example 2. Unknown likelihood. The following example derives and gen- 
eralizes Zellner's Bayesian Method of Moments [13]. Usually the relation be- 
tween x and 9 is given by a known likelihood function q{x\9) but suppose this 
relation is not known. This is the case when the joint prior is so ignorant that 
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information about x tells us nothing about 9 and vise versa; such a prior treats 
x and 9 as statistically independent, q(x,9) = q(x)q{9). Since we have no like- 
lihood function the information about the relation between 9 and the data x 
must be supplied elsewhere. One possibility is through a constraint. Suppose 
that in addition to normalization and the uncertain data constraint, eq. (|14[) . we 
also know that the expected value over 9 of a function f(x, 9) is 

(f) x = fd6p(6\x)f(x,9)=F(x). (17) 

We seek a posterior P(x, 9) that maximizes (J5J). Introducing Lagrange multipli- 
ers a, X(x), and j(x), 

= 5 {S + a [J dxd6 p(x, 9) - l] + J dx\{x)[fd9p(x, 9)- P D (x)] (18) 
+ Jdx 1 (x)[Jd9p(x,9)f(x,9)-P D (x)F(x)]} , (19) 

the variation over p(x, 9) yields 

P(x, 9) = ^q(x)q(9) e K*)+i{x)f(*fi) ; ( 20 ) 

where C is a normalization constant. The multiplier X(x) is determined from 

P{x) =fd9 P(6, x) = ^q{x)e x(x) J d9 q{9) e^ (l)/(l ^ = P D (x) (21) 

then, 

„(/)) pl{x)f(x,9) 

p ^-^ jwm^^ (22) 



so that 



The multiplier 7(2;) is determined from (|17[) 

1 3Z(a:) 
Z(i) <9 7 (x) 

The corresponding marginal posterior is 



F(x) . (24) 



■y(x)f(x,0) 

P{9) = J dxP D (x)P(9\x) = q(9) J dxP D (x) 6 ' . (25) 

In the limit when the data are sharply determined Pd{x) = 8(x — x') the 
posterior takes the form of Bayes theorem, 

pl {x')f(x'fi) 

m=m^ r , (26) 
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where up to a normalization factor e 7 ^ x ,6 '- ) plays the role of the likelihood 



and the normalization constant Z plays the role of the evidence. 

In conclusion, these examples demonstrate that the method of maximum en- 
tropy can fully reproduce the results obtained by the standard Bayesian methods 
and allows us to extend them to situations that lie beyond their reach such as 
when the likelihood function is not known. Other such examples are given in 
[11] and [12]. 

5 Deviations from maximum entropy 

To complete the design of the ME method we must address one last issue. Once 
we have decided that the distribution that maximizes entropy is to be preferred 
over all others we ask: to what extent are the other distributions ruled out? 
The discussion below follows [8] [12] . 

The original problem was to update from a prior q{x) given constraints 
that define the space O of acceptable distributions. We assume that these 
distributions, that is, the "points" in the space 6, can be labelled by coordinates 
9. Thus, O is a statistical manifold and its points can be written as p(x\9). 
Maximizing S[p,q] over all the p(x\9) in O leads to the preferred distribution, 
say p(x\9 ). 

The question about the extent that distributions with 9 ^ 9q are ruled out 
is a question about the probability of various values of 9: to what extent do 
we believe that the selected value should lie within any particular range dOl 
Thus we are not just concerned with the probability of x, but with the joint 
distribution p(x, 9). To assign p(x, 9) we apply the same ME method but in the 
larger joint space: maximize the joint entropy 



for a suitable prior q(x, 9) and under the appropriate constraints. 

Choosing a prior is always tricky because it represents what we knew before 
the relevant new information became available. We want to represent a state of 
extreme ignorance: the precise relation between 9s and xs is not (yet) known 
and therefore q(x, 9) is a product, q(x, 9) = q(x)q(9), so that knowing x tells us 
nothing about 9 and vice versa. For q(x) we retain the prior used in the original 
problem where we updated from q(x) to p(x\9o)- 

For q{9) we plead ignorance once again and choose a uniform distribution. 
This is somewhat trickier than may seem at first sight because uniform does 
not mean constant. The uniform distribution assigns equal probabilities to equal 
volumes in O and does not depend on the particular choice of coordinates. (A 
constant distribution, on the other hand, depends on the choice of coordinates: 
a distribution that is constant in one frame coordinate will not be constant 
in another.) This requires a well-defined notion of volume. Fortunately, the 
statistical manifold O is a metric space: there is a single unique geometry that 
properly takes into account the fact that the points in O are not structureless 




(27) 
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points but are actual probability distributions. This is given by the Fisher- 
Rao information metric, gij(9) |14j|12j. The corresponding volume elements are 
given by g 1 / 2 (9)d n 9, where g = det gij. Therefore the uniform (unnormalized) 
prior is q{9) = g 1 ^ 2 (9) and the joint prior is q(x,6) = g 1 ^ 2 (9)q(x). 

The crucial constraint on the joint distributions p(x, 8) = p(9)p(x\6) specifies 
the conditional distributions p{x\9). This amounts to selecting the particular 
space 9 under consideration. 

The preferred joint distribution P(x,9) is that which maximizes the joint 
entropy S[p, q] over all normalized distributions of the form p{x, 9) — p(9)p(x\9) 
where we vary with respect to p(9) and restrict to p{x\9) G 0. It is convenient 
to rewrite (f2"T|) as 

S\p, q] = - j d9p{9) log -A- + J d9p(9)S(9), (28) 

where 

S(9) = - J dx P (x\9) log^^. (29) 
The result is the probability that 9 lies within a small volume g 1 ^ 2 (9)d n 9, 

P(9)d n 9 = i e s ^g 1 ' 2 {9)d n 8 with £ = J d n 9 g 1 ' 2 {9) e s{e) . (30) 



The preferred value of 9 is that 9q which maximizes the entropy S(9), eq. ([29|) . 
because this maximizes the scalar probability density expS(9). But it also tells 
us the degree to which values of 9 away from the maximum are ruled out. 

One of the limitations of the standard MaxEnt method is that it selects a 
single "posterior" p(x\9o) and all other distributions are strictly ruled out. The 
result (|3T))) overcomes this limitation and finds many applications. For example, 
it extends the Einstein theory of thermodynamic fluctuations beyond the regime 
of small fluctuations; it provides a bridge to the theory of large deviations; and, 
suitably adapted for Bayesian data analysis, it leads to the notion of entropic 
priors. 



6 Conclusions 

Any Bayesian account of the notion of information cannot ignore the fact that 
Bayesians are concerned with the beliefs of rational agents. The relation be- 
tween information and beliefs must be clearly spelled out. The definition we 
have proposed - that information is that which constrains rational beliefs and 
therefore forces the agent to change its mind - is convenient for two reasons. 
First, the information/belief relation is explicit, and second, the definition is 
ideally suited for quantitative manipulation using the ME method. 

The main conclusion is that the logarithmic relative entropy is the only 
candidate for a general method for updating probabilities - the ME method 
- and this includes both MaxEnt and Bayes' rule as special cases; it unifies 
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them into a single theory of inductive inference and allows new applications. 
Indeed, much as the old MaxEnt method provided the foundation for statistical 
mechanics, recent work suggests that the extended ME method provides an 
entropic foundation for quantum mechanics. 
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